TargetRNA3: predicting prokaryotic RNA regulatory targets with machine learning

Small regulatory RNAs pervade prokaryotes, with the best-studied family of these non-coding genes corresponding to trans-acting regulators that bind via base pairing to their message targets. Given the increasing frequency with which these genes are being identified, it is important that methods for illuminating their regulatory targets keep pace. Using a machine learning approach, we investigate thousands of interactions between small RNAs and their targets, and we interrogate more than a hundred features indicative of these interactions. We present a new method, TargetRNA3, for predicting targets of small RNA regulators and show that it outperforms existing approaches. TargetRNA3 is available at https://cs.wellesley.edu/~btjaden/TargetRNA3. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03117-2.


Supplementary Material
TargetRNA3: Predicting prokaryotic RNA regulatory targets with machine learning

Contents
• Principal component analysis • Fig. S1.Variance in data explained by principal components • Fig. S2.Interaction data with respect to most significant principal components • Fig. S3.Relationships to evinced interactions of select features used by TargetRNA3 • Table S1 • Table S2 • Table S3 • Table S4 • Table S5 Principal Component Analysis To better understand the relationships between the 111 features, we performed a principal component analysis (PCA).Instead of representing points (sRNA and candidate target pairs) by their 111 feature values, we consider points with respect to 111 principal components, i.e., the eigenspace determined from the covariance matrix.We then remove principal components in order of increasing corresponding eigenvalues.Fig. S1 shows the percent of variance explained when different numbers of principal components (from 1 to 111) are used as the dimensionality of the points is decreased.The figure shows that the features are not independent of each other, for instance more than half of the variance in the data can be explained by fewer than 20 of the 111 principal components.This is unsurprising since so many features are related, e.g., a number of features capture some form of the strength of hybridization between the two RNA sequences.
Fig. S2 shows the data points (sRNA and candidate target pairs) projected into a space defined by the two or three most significant principal components.Visually, there does not appear to be much separation between points corresponding to interactions and points corresponding to non-interactions in these low dimensional spaces.Ultimately, we did not further pursue the use of PCA because it requires calculating all 111 features, and some of the features are too computationally slow to calculate in real time for our purposes.S3.
The table provides a more detailed description of the 118 columns from the large dataset found in Table S2.The first five columns in

Fig. S2 .
Fig. S2.Interaction data with respect to most significant principal components.

Fig. S3 .
Fig. S3.Relationships to evinced interactions of select features used by TargetRNA3.

Table S1 .
The table shows the phylum and class of the 13 genomes investigated in this study and the number of sRNAs from each genome.

Table S2
identify each sRNA and candidate target.The following 111 columns in TableS2correspond to features that may be used for predicting whether a sRNA interacts with a candidate target.The penultimate column in TableS2contains the probability that a sRNA and candidate target interact, as predicted by TargetRNA3.The final column in TableS2indicates whether or not there is experimental evidence that a sRNA and candidate target interact.Features in the table below whose values were computed using an existing computational tool are indicated.The runtime to calculate each feature on a genome-wide scale is described in the table below as fast (generally requiring seconds or less), medium (generally requiring minutes), or slow (generally requiring one or more hours).

Table S5 .
The table indicates the performance of 6 different tools for predicting targets of sRNA regulation.